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Abstract 

Whileas the Kohonen Self Organizing Map shows an asymptotic level 
density following a power law with a magnification exponent 2/3, it would 
be desired to have an exponent 1 in order to provide optimal mapping in 
the sense of information theory. In this paper, we study analytically and 
numerically the magnification behaviour of the Elastic Net algorithm as a 
model for self-organizing feature maps. In contrast to the Kohonen map the 
Elastic Net shows no power law, but for onedimensional maps nevertheless 
the density follows an universal magnification law, i.e. depends on the local 
stimulus density only and is independent on position and decouples from the 
stimulus density at other positions. 

Self Organizing Feature Maps map an input space, such as the retina or skin 
receptor fields, into a neural layer by feedforward structures with lateral inhibition. 
Biological maps show as defining properties topology preservation, error tolerance, 
plasticity (the ability of adaptation to changes in input space) , and self-organized 
formation by a local process, since the global structure cannot be coded genetically. 
The self-organizing feature map algorithm proposed by Kohonen ^ has become a 
successful model for topology preserving primary sensory processing in the cortex 
[2], and an useful tool in technical applications 

The Kohonen algorithm for Self Organizing Feature Maps is defined as follows: 
Every stimulus v of an euclidian input space V is mapped to the neuron with 
the position s in the neural layer R with the highest neural activity, given by the 
condition 

|ws - v| = miureH |wr - v| (1) 

where |.| denotes the euclidian distance in input space. In the Kohonen model the 
learning rule for each synaptic weight vector Wr is given by 



w 



°'^+ry..grs-(v-w°'^) (2) 



with ^rs as a gaussian function of euclidian distance |r — s| in the neural layer. 
The function g^s describes the topology in the neural layer. The parameter rj 
determines the speed of learning and can be adjusted during the learning process. 
Topology preservation is enforced by the common update of all weight vectors 
whose neuron r is adjacent to the center of excitation s. 
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1 The Elastic Net Feature Map 

The Elastic Net was proposed for solving optimization problems like the famous 
Travelling Salesman Problem. Here we apply this concept to feature maps. The 
Elastic Net is defined as a gradient descent in the energy landscape 

^ = E InY^e-^-'^-'-^'^^-' + - w,.|^ (3) 

^ r r 

with the input vectors denoted by v^. Here r is the index of the neurons in an 
one-dimensional array (for the TSP: with periodic boundary conditions), and v^r 
is the synaptic weight vector of that neuron. For cr — ^ © becomes 



lim E 



\ - w,(..))2 + ^ ^ |w.+i - w.p. (4) 



Here s(v'^) denotes the neuron with the smallest distance to the stimulus, the 
winning neuron, which is assumed to be nondegenerate. A gradient descent in the 
first term (which can be interpreted as an entropy term |5j) leads for sufficiently 
small a to the condensation of (at least) one weight vector to each input vector, if 
the input space is discrete. The second term is the potential energy of an elastic 
string between the weight vectors, and gradient descent in this term leads to a 
minimization of the (squared!) distances between the weight vectors. 

Depending on parameter adjustment |51|7] a gradient descent in E can provide 
near-optimal solutions to the TSP within polynomial processing time ^ , similar as 
the Kohonen algorithm 3 . We remark that in the Travelling Salesman application 
(if the numbers of neurons and cities are chosen to be equal) both the Elastic Net 
and the Kohonen algorithm share the same zero |5 ^-iid first |j9j order terms and are 
therefore related for the final state of convergence, although their initial ordering 
process is different. 

The update rule of the Elastic Net Algorithm is the gradient descent in Q: 

1 g-(v^-w,)V2<T2 

-(5Wr = > (v^ - Wr) J— N2 /o 2 + ^AWr, (5) 

^ r ^\ '^^ J^j.'g-(vf'-w_^,)2/2(T2 V J 

where Aw^ = Wr~i — '2.Wr + Wr+i denotes the discrete Laplacian. 

If we apply this concept to feature maps, we have to replace the sum over 
all input vectors by an integral over / p(v)dv, i.e. a probability density. If we 
interpret as a neural feature mapping algorithm, it is a pattern parallel learning 
rule, or batch update rule, where contributions of all patterns are summed up 
to one update term. In the brain, hovever, patterns are presented serially in a 
stochastic sequence. Therefore we generalize this algorithm to serial presentation: 

g-(v-w,)V2<T2 

-(5Wr = (v — Wr) -, >2 /o 2 + ^AWf. (6) 
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In Monte Carlo simulations of this model, one chooses input vectors v accord- 
ing to the probablility density function p(y) and updates Wr for every neuron r 
in the neural layer according to ©. The algorithm can be viewed as a stochastic 
approximation algorithm that converges if the conditions X]t^o'?^(^) 
X^t^o ^(^) ~ °° ^'^^ time development of parameter ry are fulfilled |10| The 
simultaneous adjustment of k and a has been discussed in (BJ Q for the special 
case of the TSP optimization problem. For the TSP it appears necessary to adjust 
k/(7 to a system-size-dependent value to avoid 'spike defects' for small k/ct and 
'frozen bead defects' for large k/ct when annealing tr — > 0. Both 'defects' are no 
defects in feature maps, the 'spike defects' can only occur for delta-peaked stimuli 
(cities) together with a dimension-reduction. 

The aim in feature maps is different. Using the Kohonen algorithm, one tries to 
start with large-ranged interaction in the neural layer to avoid global topological 
defects. This is not directly possible for the Elastic Net, as its learning cooperation 
is restricted to next-neighbour. Only the strength of the elastic spring k can be 
initialized with a high value and decreased after global ordering. The parameter 
a is to be interpreted as a resolution length in feature space, e. g. the distance 
between two receptors in skin or retina. For selectivity of the winner-take-all 
mechanism, one would choose a smaller or alike the average or minimal distance 
between adjacent weight vectors. 

2 Asymptotic Density and the Magnification Fac- 
tor 

In this paper we consider the case of continuously distributed input spaces with 
same dimensionality as the neural layer, so there is no reduction of dimension. 

The magnification factor is defined as the density of neurons r (i. e. the density 
of synaptic weight vectors Wr) per unit volume of input space, and therefore is 
given by the inverse Jacobian of the mapping from input space to neural layer: 
M — \ J\^^ = \det{dw/dr)\^^ . (In the following we consider the case of nonin- 
verting mappings, where J is positive.) The magnification factor is a property of 
the networks' response to a given probability density of stimuli f'(v). To evaluate 
M in higher dimensions, one in general has to compute the equilibrium state of 
the whole network using global knowledge on P(v). 

For one-dimensional mappings (and possibly for special geometric cases in 
higher dimensions) the magnification factor may follow an universal magnifica- 
tion law, i.e. M(w(r)) is a function only of the local probability density P and 
independent of both location r in the neural layer and w(r) in input space. 

An optimal map from the view of information theory would reproduce the 
input probability exactly (M ~ P(yY with p — 1), according to a power law with 
exponent 1, equivalent to all neurons in the layer fire with same probability. An 
algorithm of maximizing mutual information has been given by Linsker |11| . 

For the classical Kohonen algorithm the magnification law (for one-dimensio- 
nal mappings) is a power law M(w(r)) (x P(w(r))'' with exponent p = | 53 ■ 
For a discrete neural layer and especially for neighborhood kernels with different 
shape and range there are corrections to the magnification law pi ll3[[Tl) . 



941 



942 J.C. Claussen and H.G. Schuster 



V, 



3 Magnification Exponent of the Elastic Net 

The necessary condition for the final state of algorithm ® is that for all neurons 
r the expectation value of the learning step vanishes: 

= J dvp(v)5wr(v). (7) 

Since this expectation value is equal to the learning step of the pattern parallel 
rule equation {T)) is the stationary state condition for both serial and paral- 
lel updating. Inserting the learning rule © to condition (TJ, we obtain for the 
invariant density Wr in the onc-dimcnsional case: 

= y l^(y - u,r)j-^-j—^—-;j^ + KAWrjP{v)dv. 

In the limit of a continuous neural layer for every stimulus v there exists one 
unique center of excitation s with v — Wg. Thus we can substitute integration over 
dv by integration over ds. Using the Jacobian J(s) := dw{s)/ds, we have 

f ( g^(tB(s)-fi)(r))V2<T^ \ 

= y (^(.) - ^(r)) ^ + ^A^(r) J P{i,{s))J{s)ds. 

The second term becomes The normalization integral is (p := s — r ): 

/ e-(-(-)--(-'))V2-=rf/ ^ J g-pV2(-/J(.))=rfp + 0(^3) ^ ^ . _^ + o(^3)^ 

For the following equations, we define the abbreviation P{r) :— P{w{r)). Us- 
ing parametric differentiation, substitution ds = dws/^dws/ds) = dwg/ J{s), and 
saddlepoint expansion (method of steepest descent) for tr ^ 0, the first integral 
becomes (after Simic '15'): 



1 



2tt a 



{w{s) - «)(r))e-(™(^)-™('-»'/2'T=^P(w(s)) J(s)2ds 
T I d 



/27r Jir) dr 
aid 
/27r Jir) dr 



e~^^^'^-^^^^y'^^ P{w{s))J{sfds 
e-(*(^)-'^(''»'/2o-=^P(w(s))J(u;(s))du;(s) 



^ \pir)J{r))+o{a')^a'{'^4: + ^'^4:) + o{a% (8) 



J{r) dr dr J dr 

Neglecting higher orders of cr, we obtain 







J{r) dr 



This is a first-order nonlinear differential equation for J{r) to a given input density 
P(f). However, this can be expressed explicitly only if (additional to P{v)) the 
complete equilibrium state w{r) is known, and then one obtains J(r) directly by 
evaluating the first derivative. Thus the differential equation gives further 
insight only if J{r) follows an universal scaling law without explicit dependence 
on the location r, that is, J is a function of P only. 
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The ansatz J{r) = J(P(r)) leads for all r, where dP/dr ^ 0, to the differen- 
tial equation for the invariant state of the one-dimensional Elastic Net Algorithm 



dP ~^P ' \ ^ '^PJ 



(10) 



The first derivative depends only on J/ P. The gradient field of H1U|) has two 
regimes: For k/ct^ ('soft string tension') dJ/dP = —J/P, therefore M = 
J^^ ^ P(v)^. The magnification exponent is asymptotically 1 and cortical rep- 
resentation is near to the optimum given by information theory. For k/ct^ —> oo 
('hard string tension') dJ/dP 0, therefore M = has a constant value. Here 
all adaptation to the stimuli vanishes, equivalent to a magnification exponent of 
zero. 

Substituting X := InP, Y := - In J and Z := X + Y, ITUJl can be solved 
exactly (see Fig. ^ 



1 K 1 



In Af = - (\n(PM) + In (l + — , 

2 V y ^ 2a^ PM' 



const. 



(11) 
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Figure 1: Solutions of equation ((117)) for 




1, 1/2 (middle) and 1/4. 



Thus the magnification exponent depends only on the local input probability 
density M ~ ppi^) ^ and we have pq = ^ = p + X^^^^j^, where p — Pq for limiting 
cases with dp{X)/dX 0. For k ^ the magnification exponent shifts from 1 
to zero according to equation Hll)|l , rewritten as 



1 



dX 



1 



cr2 PM 



(12) 



Finally we remark that the decomposition ^ of the parallel update rule to 
update responses to the stimuli is not unique. Especially the elastic term can be 
decomposed in a siutable stimulus-dependent manner so that elasticity is appended 
only in vicinity of the stimulus. This Local Elastic Net reads 

SMVr = r7-M''(v,Wr)-(v-Wr) + K((l-I/)-^("'")(v,Wr)+!/)- AWr}, 

where A is a normalized gaussian function of distance, a ~ 1 and < v < 1. A 
small global elasticity (e.g. v = 0.05) smoothes fluctuations, but the "forgetting" 
due to global relaxation is reduced which improves convergence. The Magnification 
law of the Local Elastic Net is similar as for the Elastic Net [S] . 
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4 Numerical Verification of the Magnification Law 

To calculate the asymptotic level density numerically, we considered the map of 
the unit interval to a onedimensional neural chain of 100 neurons with fixed first 
and last neuron. The learning rate was 0.5. The stimulus probability density was 
chosen exponentially as exp{—Pw) with /3 = 4. After an adaptation process of 

5 • 10'' steps further 10% of learning steps were used to calculate average slope 
and its fluctuation (shown in brackets) of log J as a function of log P. (The first 
and last 10% of neurons were excluded to eliminate boundary effects). The (local) 
magnification exponents were obtained as 



i cr K 
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0.03 (0.01) 


0.23 (0.01) 


0.49 (0.01) 


0.01 


0.03 (0.01) 


0.25 (0.01) 


0.77 (0.02) 


0.96 (0.06) 


0.03 


0.23 (0.01) 


0.70 (0.03) 







For the Elastic Net the parameter choice appeared crucial: Same as in the TSP 
application [J] the optimal choice of a as the average distance (in input space) 
between two adjacent neurons seems to be appropriate. For larger a clearly clus- 
tering phenomena appear due to the fact that too many neurons fall in the Gaus- 
sian neighborhood of the stimulus. For large k/ct^ the exponent decreases to zero, 
as given by the theory. For small k/ct^ the exponent first increases near to 1 but 
simultaneously instability due to clustering arises (last row). 

Whereas the simulation validates the exact result, appropriate adjustment of 
k/(T^ between optimal mapping and stability remains difficult and becomes in- 
tractable for large-scale variations of the input probability density. 
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